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ABSTRACT 

State-of-the-art extractive multi-document summarization 
systems are usually designed without any concern about pri¬ 
vacy issues, meaning that all documents are open to third 
parties. In this paper we propose a privacy-preserving ap¬ 
proach to multi-document summarization. Our approach 
enables other parties to obtain summaries without learning 
an 3 rthing else about the original documents’ content. 

We use a hashing scheme known as Secure Binary Embed¬ 
dings to convert documents representation containing key 
phrases and bag-of-words into bit strings, allowing the com¬ 
putation of approximate distances, instead of exact ones. 
Our experiments indicate that our system yields similar re¬ 
sults to its non-private counterpart on standard multi-docu¬ 
ment evaluation datasets. 

Categories and Subject Descriptors 

H. 3 [Information Storage and Retrieval]; 1.2.7 [Natural 
Language Processing]: Text analysis; K.4.1 [Computers 
and Society): Public Policy Issues— privacy 

General Terms 

Algorithms, Experimentation 

Keywords 

Secure Summarization, Multi-document Summarization, Wa¬ 
terfall KP-Centrality, Secure Binary Embeddings, Data Pri¬ 
vacy 

I. INTRODUCTION 

Extractive Multi-document Summarization (EMS) is the 
problem of extracting the most important sentences in a 
set of documents. State-of-the-art solutions for EMS based 
on Waterfall KP-Centrality achieve excellent results [^. A 
limitation to the usage of such methods is their assump¬ 
tion that the input texts are of public domain. However, 
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problems arise when these documents cannot be made pub¬ 
lic. Consider the scenario where a company has millions of 
classified documents organized into several topics. The com¬ 
pany may need to obtain a summary from each topic, but it 
lacks the computational power or know-how to do so. At the 
same time, they can not share those documents with a third 
party with such capabilities, as they may contain sensitive 
information. As a result, the company must obfuscate their 
own data before sending it to the third party, a requirement 
that is seemingly at odds with the objective of extracting 
summaries from it. 

In this paper, we propose a new privacy-preserving tech¬ 
nique for EMS based on Secure Binary Embeddings (SBE) 
that enables exactly this - it provides a mechanism for 
obfuscating, not only named-entities [^, but the complete 
data, while still achieving near state-of-art performance in 
EMS. SBE is a kind of locality-sensitive hashing algorithm 
which converts data arrays such as bag-of-words vectors to 
obfuscated bit strings through a combination of random pro¬ 
jections followed by banded quantization. The method has 
information theoretic guarantees of security, ensuring that 
the original data cannot be recovered from the bit strings. 

They also provide a mechanism for computing distances 
between vectors that are close to one another without reveal¬ 
ing the global geometry of the data, such as the number of 
features, consequently enabling tasks such as EMS. This is 
achievable because, unlike other hashing methods which re¬ 
quire exact matches for performing retrieval or classification 
tasks, SBE allows for a near-exact matching: the hashes can 
be used to estimate the distances between vectors that are 
very close, but provably provide no information whatsoever 
about the distance between vectors that are farther apart. 
The usefulness of SBE has already been shown in privacy¬ 
preserving important passage retrieval and speaker ver¬ 
ification systems, yielding promising results. 

2. RELATED WORK 

2.1 Multi-document Summarization 

Most of the current work in automatic summarization fo¬ 
cuses on extractive summarization. Popular baselines for 
multi-document summarization fall into one of the follow¬ 
ing general models: Centrality-based [15[ 1^ [^, Maximal 
Marginal Relevance (MMR) [null!]. and Coverage-based 
methods [^. Additionally, methods such as KP-Centrality 
[16| , which is centrality and coverage-based, follow more 
than one paradigm. In general. Centrality-based models are 
used to produce generic summaries, while the MMR fam- 


ily generates query-oriented ones. Coverage-based models 
produce summaries driven by words, topics or events. 

We use the Waterfall KP-Centrality method because it is 
a state-of-the-art EMS method, but the ideas in this work 
could be applied to any other EMS methods. 

2.2 Privacy-Preserving Methods 

In this work, we focus on creating a method for perform¬ 
ing EMS while keeping the original documents private. To 
the best of our knowledge, the combination of research lines 
has only been explored for the single-document summariza¬ 
tion case [^. However, there are some additional recent 
works combining information retrieval and privacy. Most of 
these works use data encryption to transfer the data 

in a secure way. The problem with these methods is that the 
entity responsible for producing the summaries will have ac¬ 
cess to the documents content, while our method guarantees 
that no party aside from the owner of the documents will 
have access to their content. Another secure information 
retrieval methodology is to obfuscate queries, which hides 
user topical intention but does not secure the content of the 
documents [13| . 

In many areas, the interest in privacy-preserving meth¬ 
ods where two or more parties are involved and they wish 
to jointly perform a given operation without disclosing their 
private information is not new, and several techniques such 
as Garbled Circuits (GC), Homomorphic Encryption (HE) 
and Locality-Sensitive Hashing (LSH) have been introduced. 
However, they all have limitations regarding the EMS task 
we wish to address. Until recently, GG methods were ex¬ 
tremely inefficient and difficult to adapt, specially when the 
computation of non-linear operations, such as the cosine dis¬ 
tance, is required. Systems based on HE techniques usually 
require extremely long amounts of time to evaluate any func¬ 
tion of interest. The LSH technique allows for near-exact 
match detection between data points, but does not provide 
any actual notion of distance, leading to degradation of per¬ 
formance in some applications. As a result, we decided to 
consider SHE as the data privacy for our approach, as it does 
not show any of the disadvantages mentioned above for the 
task at hand. 

3. MULTI-DOCUMENT SUMMARIZATION 

To determine the most representative sentences of a set 
of documents, we used a multi-document approach based on 
KP-Centrality [^. This method is adaptable and robust 
in the presence of noisy input. This is an important aspect 
since using several documents as input frequently increases 
the amount of unimportant content. 

Waterfall KP-Centrality iteratively combines the summa¬ 
ries of each document that was generated using KP-Centrali- 
ty following a cascade process: it starts by merging the inter¬ 
mediate summaries of the first two documents, according to 
their chronological order. This merged intermediate sum¬ 
mary is then summarized and merged with the summary 
of following document. We iterate this process through all 
documents until the most recent one. The summarization 
method uses as input a set of key phrases that we extract 
from each input document, joins the extracted sets, and 
ranks the key phrases using their frequency. To generate 
each intermediate summary, we use the top key phrases, ex¬ 
cluding the ones that do not occur in the input document. 

KP-Centrality extracts a set of key phrases using a su¬ 


pervised approach and combines them with a bag-of-words 
model in a compact matrix representation, given by: 

‘w(ti,pi) ... w(ti,pjv) w{ti,ki) ... w{ti,kMy 

w{tT,Pl) ... w{tT,PN) w{tT,kl) ... w{tT,kM) 

(1) 

where ui is a function of the number of occurrences of each 
term t in every passage p or key phrase k,T is the number of 
terms, N is the number of sentences and M is the number of 
key phrases. Then, using I U K = pi, ... ,Pn, fer,..., kM = 
qi,..., qisr+M, a support set Si is computed for each passage 
Pi using: 

Si = {s £ I D K -. sim{s, qi) > Si A s y qi}, (2) 

for i = 1,..., N + M. Passages are ranked excluding the set 
of key phrases {artificial passages) according to: 

argmax [{S'i : s G Si}|. (3) 

s6(u)LiSi)-X 

A support set is a group of the most semantically related 
passages. These semantic passages are selected using heuris¬ 
tics based on the passage order method . The metric that 
is normally used is the cosine distance. 

4. SECURE BINARY EMBEDDINGS 

An SHE is a scheme for converting vectors to bit sequences 
using quantized random projections. It produces a LSH 
method with an interesting property: if the Euclidean dis¬ 
tance between two vectors is lower than a certain threshold, 
then the Hamming distance between their hashes is propor¬ 
tional to the Euclidean distance; otherwise, no information 
can be infered. This scheme is based on the concept of Uni¬ 
versal Quantization (UQ)[^, which redefines scalar quan¬ 
tization by forcing the quantization function to have non¬ 
contiguous quantization regions. That is, the quantization 
process converts an L-dimensional vector x G into an 
M-bit binary sequence, where the m-th bit is defined by: 

/ N /" (x, am)-I-Wm \ 

a^(x) = Q(^^ ——) ( 4 ) 

Here (,) represents a dot product, am G R^ is a “mea¬ 
surement” vector comprising L i.i.d. samples drawn from 
N{p = 0,a^), Am is a precision parameter and Wm is ran¬ 
dom number drawn from a uniform distribution over [0, Am] • 
Q{-) is a quantization function given by Q{x) = [2;%2J. We 
can represent the complete quantization into M bits com¬ 
pactly in vector form: 

q(x) = Q (A^^(Ax -b w)) (5) 

Here q(x) is an M-bit binary vector, which we will refer to 
as the hash of x, A G R*^^^ is a matrix of random elements 
drawn from JV{p, — 0,a^), A is a diagonal matrix with en¬ 
tries Am and w G R*^ is a vector of random elements drawn 
from a uniform distribution over [0, Am]. The universal 1- 
bit quantizer of Equation maps the real line onto 1/0 in a 
banded manner, where each band is Am wide. Figure[^com- 
pares conventional scalar 1-bit quantization (left panel) with 
the equivalent universal 1-bit quantization (right panel). 

The binary hash vector generated by the Universal Quan¬ 
tizer of Equation has an interesting property: the ham¬ 
ming distance H'amm(q(x), q)(y)) between the hashes of 
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Figure 1: 1-bit quantization fnnctions. 


two vectors x and y is correlated to the Euclidean distance 
||x — y|| between the two vectors, if the Euclidean distance 
between the two vectors is less than a threshold (which de¬ 
pends on Am). However, if the distance between x and y 
is greater than this threshold, -ffamm(q(x), q)(y)) yields no 
information about the true distance between the vectors . 

In order to illustrate how this scheme works, we randomly 
generated samples in a high-dimensional space {L — 1024) 
and plotted the normalized Hamming distance between their 
hashes against the Euclidean distance between the respec¬ 
tive samples. This is presented in Figure The number of 
bits in the hash is also shown in the figures. 


L=1024, M=256 ^ L=1024, M=4096 



Euclidean distance Euclidean distance 


Figure 2: Embedding behaviour for different values 
of A and different amounts of measurements M. 

We note that in all cases, once the normalized distance 
exceeds A, the hamming distance between the hashes of two 
vectors ceases to provide any information about the true 
distance between the vectors. We will find this properly 
useful in developing our privacy-preserving MDS system. 

We also see that changing the value of the precision pa¬ 
rameter A allows us to adjust the distance threshold until 
which the hamming distance is informative. Also, increasing 
the number of bits M leads to a reduction of the variance 
of the Hamming distance. Yet another interesting property 
conjectured for the SHE is that recovering x from q(x) is 
AP-hard, even given A. 

5. SECURE MULTI-DOCUMENT 
SUMMARIZATION 

Our methodology consists in iteratively running the secure 
single-document summarization method [^, which com¬ 
prises four stages. In the first stage we obtain a represen¬ 
tation of each document, which is the first step of the KP- 
Centrality method. In the second stage we compute SBE 
hashes using the document representation. The third stage 
ranks the passages, which corresponds to the second step 
of the KP-Centrality method. Because we are now work¬ 
ing with SBE hashes instead of the original document rep¬ 
resentation, this is performed using the Hamming distance 
instead of the cosine distance. Finally, the last stage is to 
use the ranks of sentences to obtain the summary. 

Our approach for a privacy-preserving multi-document 
summarization system closely follows the formulation pre¬ 


sented in Section However, there is a very important 
difference in terms of who performs each of the steps of the 
single-document summarization method. Typically, the only 
party involved, Alice, who owns the original documents, per¬ 
forms key phrase extraction, combines them with the bag-of- 
words model in a compact matrix representation, computes 
the support sets for each document and finally uses them to 
retrieve the summaries. In our scenario, Alice does not know 
how to extract the important passages from the document 
collection and/or does not possess the computational power 
to do so. Therefore, she must outsource the summarization 
process to a another entity, Bob, who has these capabili¬ 
ties. However, Alice must hrst obfuscate the information 
contained in the compact matrix representation. If Bob re¬ 
ceives this information as is, he could use the term frequen¬ 
cies to infer on the contents of the original documents and 
gain access to private or classihed information Alice does not 
wish to disclose. Alice computes binary hashes of her com¬ 
pact matrix representation using the method described in 
Section]^ keeping the randomization parameters A and w 
to herself. She sends these hashes to Bob, who computes the 
support sets and extracts the important passages. Because 
Bob receives binary hashes instead of the original matrix 
representation, he must use the normalized Hamming dis¬ 
tance instead of the cosine distance in this step, since it is 
the metric the SBE hashes best relate to. Finally, he returns 
the hashes corresponding to the important passages to Al¬ 
ice, who then uses them to get the information she desires. 
These steps are repeated as many times as needed until the 
multi-document summarization process is complete. 

6. EXPERIMENTS AND RESULTS 

In this section we illustrate the performance of our privacy¬ 
preserving approach to EMS and how it compares to its non¬ 
private counterpart. We start by presenting the datasets we 
used in our experiments, then we describe the experimental 
setup and finally we present some results. 

To assess the quality of the summaries generated by our 
methods, we used ROUGE-1 on DUG 2007 and TAG 
2009 datasets. DUG 20010 dataset includes 45 clusters of 
25 newswire documents and 4 human-created 250-word ref¬ 
erence summaries. TAG 2OO!0has 44 topic clusters. Each 
topic has 2 sets of 10 news documents. There are 4 human- 
created 100-word reference summaries for each set. The ref¬ 
erence summaries for the Erst set are query-oriented and 
for the second set are update summaries. In this work, we 
used the first set of reference summaries. We evaluated the 
different models by generating summaries with 250 words. 

We present some baseline experiments in order to obtain 
reference values for our approach. We generated 250 words 
summaries for both TAG 2009 and DUG 2007 datasets. For 
both experiments, we used the cosine and the Euclidean dis¬ 
tance as evaluation metrics, since the first is the usual metric 
for computing textual similarity, but the second is the one 
that relates to the Secured Binary Embeddings technique. 
All results are presented in terms of ROUGE in partic¬ 
ular ROUGE-1, which is the most widely used evaluation 
measure for this scenario. The results we obtained for the 
TAG 2009 and the DUG 2007 are presented in Table 0 

We considered 40 key phrases in our experiments since it 

^ http: / / www-nlpir.nist.gov/projects / duc/duc2007/tasks.html 
^http: //www.nist.gov/tac/2009/Summarization/ 
















Metric 

TAG 2009 

DUG 2007 

Gosine distance 

0.514 

0.370 

Euclidean distance 

0.489 

0.364 


Table 1: Reference Waterfall KP-Centrality results 
with 40 key phrases, in terms of ROUGE—1. 


leakage 

~ 5% 

~ 25% 

~ 50% 

-75% 

- 95% 

6pc=4 

0.331 

0.343 

0.338 

0.347 

0.347 

bpc=8 

0.339 

0.341 

0.341 

0.352 

0.356 

bpc=16 

0.336 

0.348 

0.337 

0.350 

0.351 


Table 2: Waterfall KP-Centrality using SBE and the 
DUG 2007 corpus, in terms of ROUGE—1. 


leakage 

-5% 

- 25% 

- 50% 

- 75% 

- 95% 

bpc=4 

0.475 

0.472 

0.458 

0.478 

0.487 

bpc=8 

0.462 

0.472 

0.469 

0.473 

0.486 

bpc=16 

0.448 

0.467 

0.462 

0.484 

0.491 


Table 3: Waterfall KP-Centrality using SBE and the 
TAG 2009 corpus, in terms of ROUGE—1. 


is the usual choice when news articles are considered [16| . 
As expected, we notice some slight degradation when the 
Euclidean distance is considered, but we still achieve better 
results than other state-of-the-art methods such as MEAD 
[^, MMR [|, Expect n-call@k j^, and LexRank [3]. 

Reported results in the literature include ROUGE-1 = 
0.328 and 0.415 using MEAD, ROUGE-1 = 0.327 and 0.392 
using MMR, ROUGE-1 = 0.321 and 0.387 using Expect n- 
call@k for the DUG 2007 and TAG 2009 datasets, respec¬ 
tively . This means that the forced change of metric due 
to the intrinsic properties of SBE and the multiple applica¬ 
tion of SBE does not affect the validity of our approach in 
any way. 

For our privacy-preserving approach we performed exper¬ 
iments using different values for the SBE parameters. The 
results we obtained in terms of ROUGE for the DUG 2007 
and the TAG 2009 datasets are presented in Tables and 
respectively. Leakage denotes the percentage of SBE 
hashes that the normalized Hamming distance dn is pro¬ 
portional to the Euclidean distance ds between the original 
data vectors. The amount of leakage is controlled by A. 
Bits per coefficient {bpc) is the ratio between the number 
of measurements M and the dimensionality of the original 
data vectors L, i.e., bpc = M/L. Unsurprisingly, increasing 
the amount of leakage (i.e., increasing A) leads to improving 
the summarization results. However, changing bpc does not 
lead to improved performance. The reason for this might 
be due to the Waterfall KP-Centrality method using sup¬ 
port sets that consider multiple partial representations of 
all documents. Even so, the most significant results is that 
for 95% leakage there is an almost negligible loss of perfor¬ 
mance. This scenario, however, does not violate our privacy 
requisites in any way, since although most of the distances 
between hashes are known, it is not possible to use this infor¬ 
mation to obtain an 3 rthing about the original information. 

7. CONCLUSIONS AND FUTURE WORK 

In this work, we introduced a privacy-preserving technique 
for performing Extractive Multi-document Summarization 


that has similar performance to their non-private counter¬ 
part. Our Secure Binary Embeddings based approach pro¬ 
vides secure multiple documents representations that allows 
for sensitive information to be processed by third parties 
without any risk of sensitive information disclosure. We also 
found it rather interesting to observe such a small degrada¬ 
tion on the results given that we needed to compute SBE 
hashes on each iteration of our algorithm. 

Future work will explore the possibility of having multiple 
rather than a single entity supplying all the documents. 
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